Question: “How have COVID hotspots changed over time?”
We are interested in understanding the spatial and temporal characteristics of the spread of COVID-19 in the contiguous US, e.g “Where and when do COVID cases and deaths increase or decrease?” To that end, we utilize two approaches to investigate this hypothesis.
library(tidyverse)
library(lubridate)
JHUdata_cases <- read_csv("https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
JHUdata_deaths <- read_csv("https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv")
JHUdata_deaths %>% head()
JHUdata_cases %>% head()
pivot_longer with a regex to collapse all the date columns into a date variable.chr to date.group_by and summarize, since we are interested first in state-level trends.
cumul for cumulative state-level data.cases <- JHUdata_cases %>%
pivot_longer(matches("[0-9]+/[0-9]+/[0-9]+"),names_to="date",values_to="cumul") %>%
rename(c("state"="Province_State")) %>%
mutate(date=mdy(date)) %>%
group_by(date,state) %>% summarize(cumul=sum(cumul))
deaths <- JHUdata_deaths %>%
pivot_longer(matches("[0-9]+/[0-9]+/[0-9]+"),names_to="date",values_to="cumul") %>%
rename(c("state"="Province_State")) %>%
mutate(date=mdy(date)) %>%
group_by(date,state) %>% summarize(cumul=sum(cumul))
dplyr lag didn’t function properly and returned NAs. So, an in-house function to compute first-order differences was written.first_diff simply iterates through a time series and appends the difference to the correct slot in a vector.add_firstdiff is meant to take a national dataframe and get the 1st order difference for each state.
reduce and full_join.# Cumulative plot!
# Not what we want.
cases %>% filter(state=="Virginia") %>%
ggplot(aes(x=date,y=cumul)) +
geom_line()
first_diff <- function(arr){
to_return <- NULL
for (i in 2:length(arr)){
to_return[i] <- arr[i] - arr[i-1]
}
return(to_return)
}
add_firstdiff <- function(df){
unique_states <- unique(df$state)
dfs <- list()
for (this_state in unique_states){
thisdf <- df %>% filter(state==this_state)
thisdf$diff <- first_diff(thisdf$cumul)
dfs[[length(dfs) + 1]] <- thisdf
}
return(dfs %>% reduce(full_join))
}
firstdiff functions are applied to the case and deaths dataframes to get the new cases and new deaths per day.
left_join to get a master dataframe.state variable is converted to lower-case to facilitate joining with GeoJSON data later.cases <- add_firstdiff(cases) %>%
select(date,state,diff) %>%
rename(c("positiveIncrease"="diff"))
deaths <- add_firstdiff(deaths) %>%
select(date,state,diff) %>%
rename(c("deathIncrease"="diff"))
data <- left_join(cases,deaths,by=c("date","state")) %>%
mutate(state=tolower(state)) %>%
filter(date >= date("2020-02-01"))
data <- data %>% filter(state %in% tolower(state.name))
# Our difference function worked!
cases %>% filter(state=="Virginia") %>%
ggplot(aes(x=date,y=positiveIncrease)) +
geom_line()
data %>% ggplot(aes(x=date,y=positiveIncrease)) + geom_line(aes(color=state),alpha=0.3)
seq to interpolate between the earliest date and the latest date by months.intervals.dates <- seq(min(data$date),max(data$date),by="months")
dates <- dates %>% append(max(data$date))
intervals <- list()
for (i in 2:length(dates)){
intervals[[i-1]] <- c(dates[i-1],dates[i])
}
states_map <- map_data("state")
paste function so we know which date interval it is.heatmap_for_givenday_deathIncrease <- function(interval){
cumul_increase <- data %>%
filter(between(date,interval[1],interval[2])) %>%
group_by(state) %>%
summarize(total=sum(deathIncrease))
#cumul_increase$statename <- tolower(state.name[match(cumul_increase$state,state.abb)])
loc_data <- left_join(states_map,cumul_increase,by=c("region"="state"))
# Used to see the max value in order to set the gradient.
#print(max(cumul_increase$total))
p <- loc_data %>%
ggplot() +
geom_polygon(aes(x=long,y=lat,group=group,fill=total),color="grey") +
scale_fill_gradient(low="white",high="red", limits=c(0,22136)) +
ggtitle(paste("New deaths: ",interval[1]," to ",interval[2])) +
theme(axis.title=element_blank(),
axis.ticks=element_blank(),
panel.grid=element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank())
return(p)
}
heatmap_for_givenday_positiveIncrease <- function(interval){
cumul_increase <- data %>%
filter(between(date,interval[1],interval[2])) %>%
group_by(state) %>%
summarize(total=sum(positiveIncrease))
#cumul_increase$statename <- tolower(state.name[match(cumul_increase$state,state.abb)])
loc_data <- left_join(states_map,cumul_increase,by=c("region"="state"))
#loc_data <- left_join(states_map,cumul_increase,by=c("region"="statename"))
# Used to see the max value in order to set the gradient.
#print(max(cumul_increase$total))
p <- loc_data %>%
ggplot() +
geom_polygon(aes(x=long,y=lat,group=group,fill=total),color="grey") +
#scale_fill_gradient(low="white",high="red",limits=c(0,1259607)) +
scale_fill_gradient(low="white",high="red",limits=c(0,1259607)) +
ggtitle(paste("New cases: ",interval[1]," to ",interval[2])) +
theme(axis.title=element_blank(),
axis.ticks=element_blank(),
panel.grid=element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank())
return(p)
}
map from purrr to apply our function readily across our list of intervals, making the process of generating each plot very straightforward.
plots <- map(intervals,heatmap_for_givenday_deathIncrease)
plots <- map(intervals,heatmap_for_givenday_positiveIncrease)
plots[[12]]
for (i in 1:length(plots)){
ggsave(plot=plots[[i]],path="../project/out",filename=paste("cases",i,".png",sep=""),width=5,height=4,device="png")
}
ffmpeg was then used to join the images together into a video.
ffmpeg -r 1 -f image2 -s 1920x1080 -i ../deaths/project/out/cases/cases%d.png -vcodec libx264 -crf 25 -pix_fmt yuv420p test.mp4
The videos cannot be embedded into a PDF,
This whole procedure can also be replicated for each region of the US, giving us a better idea of regional trends as well.
us_regions <- read_csv("https://raw.githubusercontent.com/cphalpert/census-regions/master/us%20census%20bureau%20regions%20and%20divisions.csv")
data <- left_join(cases,deaths) %>%
filter(date >= date("2020-02-01"))
data <- left_join(data,us_regions,by=c("state"="State"))
upbound argument is added to make it easier to set the absolute scale of the gradient.heatmap_for_givenday_positiveIncrease <- function(interval,region,upbound){
cumul_increase <- data %>%
filter(Region == region) %>%
filter(between(date,interval[1],interval[2])) %>%
group_by(state) %>% #added when JHU data used
mutate(state=tolower(state)) %>%
summarize(total=sum(positiveIncrease))
#cumul_increase$statename <- tolower(state.name[match(cumul_increase$state,state.abb)])
#loc_data <- inner_join(states_map,cumul_increase,by=c("region"="statename"))
loc_data <- inner_join(states_map,cumul_increase,by=c("region"="state"))
print(max(cumul_increase$total))
p <- loc_data %>%
ggplot() +
geom_polygon(aes(x=long,y=lat,group=group,fill=total),color="grey") +
scale_fill_gradient(low="white",high="red",limits=c(0,upbound)) +
ggtitle(paste(region,"- New cases: ",interval[1]," to ",interval[2])) +
theme(axis.title=element_blank(),
axis.ticks=element_blank(),
panel.grid=element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank())
return(p)
}
heatmap_for_givenday_deathIncrease <- function(interval,region,upbound){
cumul_increase <- data %>%
filter(Region == region) %>%
filter(between(date,interval[1],interval[2])) %>%
group_by(state) %>%
mutate(state=tolower(state)) %>%
summarize(total=sum(deathIncrease))
#cumul_increase$statename <- tolower(state.name[match(cumul_increase$state,state.abb)])
#loc_data <- inner_join(states_map,cumul_increase,by=c("region"="statename"))
loc_data <- inner_join(states_map,cumul_increase,by=c("region"="state"))
print(max(cumul_increase$total))
p <- loc_data %>%
ggplot() +
geom_polygon(aes(x=long,y=lat,group=group,fill=total),color="grey") +
scale_fill_gradient(low="white",high="red",limits=c(0,upbound)) +
ggtitle(paste(region,"- New deaths: ",interval[1]," to ",interval[2])) +
theme(axis.title=element_blank(),
axis.ticks=element_blank(),
panel.grid=element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank())
return(p)
}
plots <- map(intervals,heatmap_for_givenday_positiveIncrease,"South",639843)
plots <- map(intervals,heatmap_for_givenday_positiveIncrease,"West",1259607)
plots <- map(intervals,heatmap_for_givenday_positiveIncrease,"Northeast",450056)
plots <- map(intervals,heatmap_for_givenday_positiveIncrease,"Midwest",322299)
plots <- map(intervals,heatmap_for_givenday_deathIncrease,"South",10159)
plots <- map(intervals,heatmap_for_givenday_deathIncrease,"West",15922)
plots <- map(intervals,heatmap_for_givenday_deathIncrease,"Northeast",22136)
plots <- map(intervals,heatmap_for_givenday_deathIncrease,"Midwest",6538)
plots[[12]]
for (i in 1:length(plots)){
ggsave(plot=plots[[i]],path="../project/out",filename=paste("northeast_cases",i,".png",sep=""),width=5,height=4,device="png")
}
# Make analysis reproducible
set.seed(2)
library(tidyverse)
library(lubridate)
library(cluster)
library(rvest)
library(patchwork)
JHUdata_cases <- read_csv("https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
JHUdata_deaths <- read_csv("https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv")
first_diff <- function(arr){
to_return <- NULL
for (i in 2:length(arr)){
to_return[i] <- arr[i] - arr[i-1]
}
return(to_return)
}
add_firstdiff <- function(df){
unique_states <- unique(df$state)
dfs <- list()
for (this_state in unique_states){
thisdf <- df %>% filter(state==this_state)
thisdf$diff <- first_diff(thisdf$cumul)
dfs[[length(dfs) + 1]] <- thisdf
}
return(dfs %>% reduce(full_join))
}
cases <- JHUdata_cases %>% pivot_longer(matches("[0-9]+/[0-9]+/[0-9]+"),names_to="date",values_to="cumul") %>%
rename(c("state"="Province_State")) %>%
mutate(date=mdy(date)) %>%
group_by(date,state) %>% summarize(cumul=sum(cumul))
deaths <- JHUdata_deaths %>% pivot_longer(matches("[0-9]+/[0-9]+/[0-9]+"),names_to="date",values_to="cumul") %>%
rename(c("state"="Province_State")) %>%
mutate(date=mdy(date)) %>%
group_by(date,state) %>% summarize(cumul=sum(cumul))
cases <- add_firstdiff(cases) %>% select(date,state,diff) %>% rename(c("positiveIncrease"="diff"))
deaths <- add_firstdiff(deaths) %>% select(date,state,diff) %>% rename(c("deathIncrease"="diff"))
data <- left_join(cases,deaths) %>%
ungroup() %>%
filter(state %in% state.name) %>%
filter(state != "Alaska") %>%
filter(state != "Hawaii")
data %>% head()
data %>% ggplot(aes(x=date,y=positiveIncrease)) + geom_line(aes(color=state),alpha=0.3)
content <- read_html("https://inkplant.com/code/state-latitudes-longitudes")
states_latlong <- html_table(content,header=T)
states_latlong <- as_tibble(data.frame(states_latlong))
states_latlong <- states_latlong %>% rename(c("state"="State","lat"="Latitude","long"="Longitude"))
# Check how the states distribute
# Looks like the capital cities of each state, that's fine
states_map <- map_data("state")
ggplot(states_map, aes(long, lat)) +
geom_polygon(aes(group = group),fill = "white", colour = "black") +
geom_point(data=states_latlong)
data <- data %>% left_join(states_latlong)
data %>% head()
Now, for clustering. We are going to use the default Euclidean distance as our similarity metric, as we are going to cluster on date/time, new cases, new deaths, and latitude and longitude. Euclidean distance makes sense for these continuous values.
However, we need to scale or normalize each feature before we use k-means. This is done so that the differing units of the features won’t affect the similarity between 2 observations. Thus, min-max normalization was used to scale the features.
Min-max: To scale a vector \(X\), \(x_{i_{scaled}} = \frac{x_i - min(X)}{max(X) - min(X)}\) for each \(x_i\) in \(X\).
# Function to normalize a vector
norm <- function(my_vec){
my_max <- max(my_vec)
my_min <- min(my_vec)
if (my_max-my_min ==0){
print("err: divide by zero")
}
return((my_vec - my_min)/(my_max-my_min))
}
temp <- data %>% filter(!is.na(positiveIncrease) & !is.na(deathIncrease))
cor(temp$positiveIncrease,temp$deathIncrease)
## [1] 0.6295024
# Prepare the data for clustering, i.e
# remove NAs,
# and normalize features
cluster_deaths <- data %>%
filter(
!is.na(positiveIncrease) &
!is.na(deathIncrease) &
!is.na(lat) &
!is.na(long)) %>%
mutate(date=as.integer(date)) %>%
mutate(date=norm(date)) %>%
mutate(
positiveIncrease=norm(positiveIncrease),
deathIncrease=norm(deathIncrease),
lat=norm(lat),
long=norm(long)
) %>% select(-state,-positiveIncrease)
cluster_cases <- data %>%
filter(
!is.na(positiveIncrease) &
!is.na(deathIncrease) &
!is.na(lat) &
!is.na(long)) %>%
mutate(date=as.integer(date)) %>%
mutate(date=norm(date)) %>%
mutate(
positiveIncrease=norm(positiveIncrease),
deathIncrease=norm(deathIncrease),
lat=norm(lat),
long=norm(long)
) %>% select(-state,-deathIncrease)
cluster_deaths %>% head()
cluster_cases %>% head()
# Generate an elbow plot to determine good value of k
plot_elbow <- function(data,up_to){
elbow <- unlist(map(1:up_to, function(k){
clusters <- kmeans(data,k,50)
return (clusters$tot.withinss)
}))
data.frame(k=1:up_to,elbow) %>% ggplot(aes(x=k,y=elbow)) +
geom_line() + geom_point(size=5) +
scale_x_continuous(breaks=1:20)
}
plot_elbow(cluster_deaths,15) + ggtitle("Cases") +
plot_elbow(cluster_cases,15) + ggtitle("Deaths")
clusters_deaths_3 <- kmeans(cluster_deaths,centers=3,50)
clusters_cases_3 <- kmeans(cluster_cases,centers=3,50)
clusters_deaths_6 <- kmeans(cluster_deaths,centers=6,50)
clusters_cases_6 <- kmeans(cluster_cases,centers=6,50)
# Add the cluster IDs to the data.
add_clust_nums <- function(clust_out){
data_w_clust <- data %>%
filter(
!is.na(positiveIncrease) &
!is.na(deathIncrease) &
!is.na(lat) &
!is.na(long))
data_w_clust$clust_num <- clust_out$cluster
return(data_w_clust)
}
wclust_deaths_3 <- add_clust_nums(clusters_deaths_3)
wclust_deaths_6 <- add_clust_nums(clusters_deaths_6)
wclust_cases_3<- add_clust_nums(clusters_cases_3)
wclust_cases_6 <- add_clust_nums(clusters_cases_6)
days <- seq(min(data$date),max(data$date),by="months")
days[length(days)+1] <- max(data$date)
colors <- c("1"="#E69F00", "2" = "#56B4E9", "3" = "#009E73", "4" = "#F0E442", "5" ="#0072B2", "6" ="#D55E00")
plot_clust_timeseries <- function(w_clust_i,name){
loc_data <- w_clust_i %>% group_by(date,clust_num) %>%
summarize(cases=sum(positiveIncrease), deaths=sum(deathIncrease))
p1 <- loc_data %>% ggplot(aes(x=date,y=cases)) +
geom_smooth(aes(color=factor(clust_num))) + ggtitle(paste("Time series of cases by cluster:",name)) +
scale_x_continuous(breaks=days[c(T,F)]) +
scale_color_manual(values=colors)
p2 <- loc_data %>% ggplot(aes(x=date,y=deaths)) +
geom_smooth(aes(color=factor(clust_num))) + ggtitle(paste("Time series of deaths by cluster:",name)) +
scale_x_continuous(breaks=days[c(T,F)]) +
scale_color_manual(values=colors)
return(p1 / p2)
}
plot_clust_timeseries(wclust_deaths_3,"deaths_3")
plot_clust_timeseries(wclust_deaths_6,"deaths_6")
plot_clust_timeseries(wclust_cases_3,"cases_3")
plot_clust_timeseries(wclust_cases_6,"cases_6")
- Within-cluster trends
wclust_cases_3 %>% ggplot(aes(x=date,y=positiveIncrease,color=state,alpha=0.2)) + geom_smooth() + facet_wrap(~clust_num) + ggtitle("Within-cluster time-series, increase in positive cases: cases_3")
wclust_cases_3 %>% ggplot(aes(x=date,y=deathIncrease,color=state,alpha=0.2)) + geom_smooth() + facet_wrap(~clust_num) + ggtitle("Within-cluster time-series, increase in deaths: cases_3")
wclust_cases_6 %>% ggplot(aes(x=date,y=positiveIncrease,color=state,alpha=0.2)) + geom_smooth() + facet_wrap(~clust_num) + ggtitle("Within-cluster time-series, increase in positive cases: cases_6")
wclust_cases_6 %>% ggplot(aes(x=date,y=deathIncrease,color=state,alpha=0.2)) + geom_smooth() + facet_wrap(~clust_num) + ggtitle("Within-cluster time-series, increase in deaths: cases_6")
wclust_deaths_3 %>% ggplot(aes(x=date,y=positiveIncrease,color=state,alpha=0.2)) + geom_smooth() + facet_wrap(~clust_num) + ggtitle("Within-cluster time-series, increase in positive cases: deaths_3")
wclust_deaths_3 %>% ggplot(aes(x=date,y=deathIncrease,color=state,alpha=0.2)) + geom_smooth() + facet_wrap(~clust_num) + ggtitle("Within-cluster time-series, increase in deaths: deaths_3")
wclust_deaths_6 %>% ggplot(aes(x=date,y=positiveIncrease,color=state,alpha=0.2)) + geom_smooth() + facet_wrap(~clust_num) + ggtitle("Within-cluster time-series, increase in positive cases: deaths_6")
wclust_deaths_6 %>% ggplot(aes(x=date,y=deathIncrease,color=state,alpha=0.2)) + geom_smooth() + facet_wrap(~clust_num) + ggtitle("Within-cluster time-series, increase in deaths: deaths_6")
states_map <- map_data("state")
cluster_plot_for_day <- function(day,data_w_clust,name){
temp <- data_w_clust %>% filter(date==day) %>%
mutate(state=tolower(state)) %>% select(-lat,-long)
loc_data <- left_join(states_map,temp,by=c("region"="state"))
loc_data %>%
ggplot() +
geom_polygon(aes(x=long,y=lat,group=group,fill=factor(clust_num)),color="grey") +
scale_fill_manual(values=colors) +
ggtitle(paste("Location of clusters",day,name,sep=" - ")) +
theme(axis.title=element_blank(),
axis.ticks=element_blank(),
panel.grid=element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank())
}
cluster_plot_for_day("2020-07-01",wclust_cases_3,"cases_3")
cluster_plot_for_day("2021-01-01",wclust_cases_3,"cases_3")
cluster_plot_for_day("2020-07-01",wclust_cases_6,"cases_6")
cluster_plot_for_day("2021-01-01",wclust_cases_6,"cases_6")
cluster_plot_for_day("2020-07-01",wclust_deaths_3,"cases_3")
cluster_plot_for_day("2021-01-01",wclust_deaths_3,"deaths_3")
cluster_plot_for_day("2020-07-01",wclust_deaths_6,"deaths_6")
cluster_plot_for_day("2021-01-01",wclust_deaths_6,"deaths_6")
At the national level, the spread of COVID is indeed generally characterized a spread from East coast/Northeast states and West coast states through the entire country. Large spikes of cases continue to be concentrated in larger, populous states like New York and California, although strong increases are also observed in Southern and Midwestern states.
Both cases and deaths began to rise in California, Texas, and Florida around 2020-08 to 2020-09. By this point, almost all states are seeing new cases. There is a small decrease in the new number of cases in these 3 states from 2020-09 to 2020-11.
There is a dramatic, almost uniform increase in cases across the entire nation from 2020-11 through 2021-03, Some of the largest increases (Roughly in descending order), are in California, Texas, New York, Florida, Illinois, Ohio, Pennsylvania. Also of note are Arizona, Georgia, North Carolina, and Tennessee. Almost all of these increases are accompanied with roughly proportional rises in deaths.
Region by region,
Midwest: - Illinois, Michigan see the worst spike in deaths from 2020-05 to 2020-06. - Coinciding with the nationwide increase from 2020-11 to 2021-01, Illinois, Ohio, Michigan, and Indiana see large rises in cases and deaths. The rest of the Midwest (Excluding North Dakota, South Dakota, Nebraska) all see a similar rise in cases to each other (that is less than that of Ohio and Illinois). - Cases decrease in almost all Midwestern states from 03-2021 to 04-2021, except for Michigan.
Northeast: - Corresponding to the national map, New York sees considerable rises in cases and deaths. - when New York rises in cases, it is typically accompanied by a rise in its adjacent states, like Pennsylvania, New Jersey, and Massachusetts. - Compared to the rest of the Northeast, Maine, New Hampshire, and Vermont do not see as large (raw count) of a rise in cases.
South: - Corresponding with the trends in the national map, Texas and Florida dominate much of the increases in the South. - The spread throughout the rest of the South is relatively uniform in the sense that during periods of increased spread, most Southern states see similar magnitudes of increases. - However, during the 2020-12 to 2021-03 wave, cases and deaths seemed to be more concentrated in Georgia, Tennessee, North Carolina, and Virginia.
West: - California more or less dominates most of the case spread in the West over the entire range of available time. - Arizona does also at times see considerable rises in deaths and cases as well. It is probably the “2nd hotspot” in the West.
Cluster analysis:
Overall, clustering by deaths or by cases seemed to have resulted in almost identical clusters, thus justifying the decision to cluster on one or the other.
The spatial distribution of the clusters supports the practice of dividing the US into its political geographical regions. The 6-cluster runs have clusters that correspond roughly to the Northeast, the South, and the West.
Investigating the “time series by cluster” plots: - All clustering runs seemed to have created the same temporal slice at around 2020-08. We see a peak before this date and a peak after, indicating that this clustering algorithm probably thinks this is about the right date to delineate the “first wave” and the “second wave.” - This lines up with what we saw in the heatmaps, as we saw the onset of a large increase in cases around that time.
When checking cases_6, the cases of cluster 1 go down while the cases of cluster 5 go up. These correspond, respectively, to the Northeast and the South.
This checks out with what we observed in the maps - we saw a strong initial wave in the Northeast, followed by increases in the South shortly after.
After this, though, the clusters for the 2nd wave all seem to uniformly go up, indicating that the 2nd wave was a nationwide event.
Investigating the within-cluster plots, - the 3-cluster runs seem to have middling results with respect to the time series: in cluster 1, for example, California appears to have had such a large increase in cases that it essentially gets its own cluster. However, the trends of all states within each cluster seem to generally correspond to one another.
Again, in the 6 cluster runs, California is very much a strong outlier within its cluster.
We see an interesting trend insofar as the “observed similarity” of within cluster trends - for those clusters that are temporally situated earlier in the pandemic, the within cluster variance seems to be larger (the series values have larger vertical distances).
With the exception of the cluster containing California, the other clusters later in the pandemic seem to have less within cluster variance.
This may support the hypothesis that over time, pandemic dynamics within the contiguous US become more homogeneous across the entire nation.
The trends observed in the cluster analysis correspond well to those observed using heatmapping, providing extra support for the observed patterns.
These patterns are an excellent stepping stone to subsequent analyses of state-level COVID data. Since this analysis is fundamentally exploratory, it has generated several questions that would be of interest to test in a confirmatory manner. - e.g Since this analysis has suggested that grouping states by their geographical region is indeed logical, we may wish to test that hypothesis with a statistical test (i.e investigate if there is indeed a group difference between regions) -
Ranking regions by record high monthly case volume in ascending order,
| region | monthly record high case increase |
|---|---|
| Midwest | 312,326 |
| Northeast | 404,981 |
| South | 639,843 |
| West | 1,249,607 |
Ranking top 10 states by record high monthly case volume,
Investigating hotspots by region,
National trends,
Regional trends,
The cases map:
Deaths map: - 2020-03-13 to 2020-04-13 - New York has the worst count of deaths by far. Other states in the Northeast, California are seeing rises as well. - Deaths continue to worsen in New York through the next month.
Overall at the national level, - New York was a hotspot of COVID activity during the earlier period of the pandemic in the US (from March 2020 to June 2020.) - Relatively speaking, deaths were very high during this period. - New cases and deaths slow from 05 2020 to 06-2020. - Then, in 07-2020 to 08-2020, New cases and deaths begin to increase especially in - California, - Texas, - and Florida. - This continues for a bit until 10-2020, where new cases and deaths begin to slow again.
Qualitatively speaking, the deaths and the cases charts coincide quite well. - It is known that there is a “lag time” between increases in cases and increases in deaths. This may be due to the time required to report the data as well as the time required for cases to develop. - This is especially evident from December 2020 to January 2021. The cases hit their peak in California in December, but the deaths do not peak until the month after, even though new cases in California go down during that period.